#Imports
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
import plotly.plotly as py
import plotly.graph_objs as go
import warnings
warnings.filterwarnings('ignore')
In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.
This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.
Natural resource managers responsible for developing ecosystem management strategies require basic descriptive information including inventory data for forested lands to support their decision-making processes. However, managers generally do not have this type of data for inholdings or neighboring lands that are outside their immediate jurisdiction. One method of obtaining this information is through the use of predictive models.
The study area included four wilderness areas found in the Roosevelt National Forest of northern Colorado. A total of twelve cartographic measures were utilized as independent variables in the predictive models, while seven major forest cover types were used as dependent variables. Several subsets of these variables were examined to determine the best overall predictive model.
File descriptions :
Data descriptions
Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).
This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.
Some background information for these four wilderness areas: Neota (area 2) probably has the highest mean elevational value of the 4 wilderness areas. Rawah (area 1) and Comanche Peak (area 3) would have a lower mean elevational value, while Cache la Poudre (area 4) would have the lowest mean elevational value.
As for primary major tree species in these areas, Neota would have spruce/fir (type 1), while Rawah and Comanche Peak would probably have lodgepole pine (type 2) as their primary species, followed by spruce/fir and aspen (type 5). Cache la Poudre would tend to have Ponderosa pine (type 3), Douglas-fir (type 6), and cottonwood/willow (type 4).
The Rawah and Comanche Peak areas would tend to be more typical of the overall dataset than either the Neota or Cache la Poudre, due to their assortment of tree species and range of predictive variable values (elevation, etc.) Cache la Poudre would probably be more unique than the others, due to its relatively low elevation range and species composition.
Number of Attributes: 12 measures, but 54 columns of data (10 quantitative variables, 4 binary wilderness areas and 40 binary soil type variables)
Attribute information:
Given is the attribute name, attribute type, the measurement unit and a brief description. The forest cover type is the classification problem. The order of this listing corresponds to the order of numerals along the rows of the database.
Name Data Type Measurement Description
Code Designations:
Wilderness Areas:
Soil Types: 1 to 40 : based on the USFS Ecological Landtype Units (ELUs) for this study area:
Study Code USFS ELU Code Description
Note: First digit: climatic zone Second digit: geologic zones
The third and fourth ELU digits are unique to the mapping unit and have no special meaning to the climatic or geologic zones.
Forest Cover Type Classes:
For further information: https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info
The Kaggle can be found here : https://www.kaggle.com/c/sd701-cover-type-prediction-of-forests/leaderboard
The approach I am building in this notebook is a classification model selection. I am using Sklearn in the Python environment.
As we have separate Train and Test files, we do not have to split the data into Train and Test, and we can train the whole model on the X and y matrices using a simple ParamGrid Cross Validation.
However, as the file is pretty large, I have decided to explore a Train Test Split approach on a small sample in order to reduce the number of candidate models. The results are presented in the Train Test Split section.
Once the best models have been identified, I will explore further one model on the whole sample with a larger ParamGridCV, or build an ensemble model using a hard voting Classifier for example.
NB: I have decided to implement the model in a Jupyter Notebook in Python to be able to try some XGBoost and ExtraTrees classification techniques as Sk-learn offers a broader range of algorithms. I have also decided to go for some experimental data visualization.
Summary : Feature Engineering and Data Cleaning :
Model Selection :
Selected Model :
I have tested many models, but the whole process was rather a way to discover the library of Sk-Learn rather than actually trying to optimize all those models.
train_data = pd.read_csv("train-set.csv", header='infer')
train_data = train_data.set_index('Id') #The Id column will be our index
test_data = pd.read_csv("test-set.csv", header='infer')
test_data = test_data.set_index('Id')
train_data.head(10)
#Show the size of our data
print('Train data size : ' + str(train_data.shape))
print('Test data size : ' + str(test_data.shape))
round(train_data.describe(),2)
train_data['Cover_Type'].value_counts()
print("Null Values : ")
print(train_data.isnull().sum())
There seems to be no null datas. We just need to control for the type of the datas. We notice that the Soil_Type is already a One-Hot-Encoder, which makes our task easier.
print(train_data.dtypes)
All datas are numeric, so we don't have to create categorical datas or treat string columns. We will now split our 'train_data' into X, the features, and y, the output.
train_data.groupby('Cover_Type').size()
y = train_data['Cover_Type']
X = train_data.loc[:, train_data.columns != 'Cover_Type']
print('X shape : ' + str(X.shape))
print('y shape : ' + str(y.shape))
We will take a look at the correlation matrix, through a heatmap showing us the regions in which we do expect the columns of X to be significant.
fig, ax = plt.subplots(figsize=(12,8))
ax = sns.heatmap(X.corr(), xticklabels=False, yticklabels=False)
We can focus on the first columns that exculdes the One Hot Encoder.
fig, ax = plt.subplots(figsize=(12,8))
Z = X.iloc[:,0:10]
ax = sns.heatmap(Z.corr())
#We notice that the most correlated datas are the ones in the first part of the dataset, so we do expect to drop the variables that are not in this region.
Controlling for the most significantly correlated pairs of variables seems quite interesting in this case :
#Highly correlated values
corr_list = []
threshold = 0.3
size = 10
for i in range(0, size):
for j in range(i+1, size):
if (Z.corr().iloc[i,j]>= threshold and Z.corr().iloc[i,j]<1) or (Z.corr().iloc[i,j] > -1 and Z.corr().iloc[i,j]<=-threshold):
corr_list.append([Z.corr().iloc[i,j],i,j])
#Print the highest values
for v,i,j in sorted(corr_list,key= lambda x: -abs(x[0])):
print("%s and %s = %.2f" % (train_data.columns[i], train_data.columns[j], v))
Quite logically, the Hillshade at different times are correlated, as well as the horizontal and vertical distance to hydrology for example. To avoid including those pairs, we focus on the correlation between the feature and the Cover Type column :
#Highly correlated values with the answer
corr_y_list = {}
import operator
for col in train_data.columns :
corr_y_list[col] = (train_data[col].corr(train_data['Cover_Type']))
corr_y_list.pop('Cover_Type', None)
#print(corr_y_list)
print(dict(sorted(corr_y_list.items(), key=operator.itemgetter(1), reverse=True)[:20]))
If we plot the correlation between each feature and the Cover type, we obtain the following :
names = list(sorted(corr_y_list.keys()))
values = list(sorted(corr_y_list.values()))
#tick_label does the some work as plt.xticks()
plt.figure(figsize=(18,5))
plt.xticks(rotation=90)
plt.bar(range(len(corr_y_list)),values,tick_label=names)
plt.show()
And if we plot the correlation between the most significant pairs of features :
#Print correlations and column names
for v,i,j in corr_list:
sns.pairplot(train_data, hue="Cover_Type", size=6, x_vars=train_data.columns[i],y_vars=train_data.columns[j] )
plt.show()
We do notice pretty much variables aligned at 0, in Hillshade 3PM. This means there are zeros we should take care of in the Hillshade 3PM. We can take a look at the correlation between 9AM and Noon where the 3AM is null.
PM_0 = train_data[train_data.Hillshade_3pm==0]
plt.figure(figsize=(12,8))
plt.scatter(PM_0.Hillshade_9am,PM_0.Hillshade_Noon, c=PM_0.Cover_Type)
plt.suptitle('Correlation 9 AM to Noon where 3 PM is 0', fontsize=15)
plt.xlabel('Hillshade 9 AM')
plt.ylabel('Hillshade Noon')
plt.show()
There seems to be a negative correlation between Hillshade 9am and noon when 3pm is 0. We will make a regression to predict the 3pm missing values.
from sklearn.model_selection import train_test_split
data=train_data.copy()
cols=data.columns.tolist()
cols=cols[:8]+cols[9:]+[cols[8]]
data=data[cols]
X_1,y_1,X_missing,y_missing= data[data.Hillshade_3pm!=0].values[:,:-1], \
data[data.Hillshade_3pm!=0].values[:,-1:].ravel(), \
data[data.Hillshade_3pm==0].values[:,:-1], \
data[data.Hillshade_3pm==0].values[:,-1:].ravel()
X_train,X_test,y_train,y_test=train_test_split(X_1,y_1)
from sklearn.ensemble.gradient_boosting import GradientBoostingRegressor
#Fit a Gradient Boosted Regression Tree model to this dataset
gbrt=GradientBoostingRegressor(n_estimators=1000)
gbrt.fit(X_train,y_train)
train_data.Hillshade_3pm.loc[train_data.Hillshade_3pm==0].count()
train_data.Hillshade_3pm.loc[train_data.Hillshade_3pm==0]=gbrt.predict(X_missing)
train_data.Hillshade_3pm=train_data.Hillshade_3pm
plt.scatter(train_data.Hillshade_3pm, train_data.Hillshade_Noon)
There seems to be no more issue.
plt.hist(train_data['Cover_Type'], bins=30)
plt.show()
#Elevation
from scipy.stats import norm
import scipy.stats as stats
toplot = X['Elevation']
fig = plt.figure(figsize=(14,9))
ax1 = plt.subplot(221)
sns.distplot(toplot, hist = True, fit = norm, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
ax2 = plt.subplot(222)
stats.probplot(toplot, plot=plt)
#Check Skewness and Kurtosis
print("Skewness: %f" % toplot.skew())
print("Kurtosis: %f" % toplot.kurt())
#Aspect
toplot = X['Aspect']
fig = plt.figure(figsize=(14,9))
ax1 = plt.subplot(221)
sns.distplot(toplot, hist = True, fit = norm, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
ax2 = plt.subplot(222)
stats.probplot(toplot, plot=plt)
#Check Skewness and Kurtosis
print("Skewness: %f" % toplot.skew())
print("Kurtosis: %f" % toplot.kurt())
We notice that the Aspect takes on negatives values and values greater than 360. However, the azimut is supposed to go from 0 to 360°. We can expect that in fact, negative values are measure errors that should restart from 360, and measures after 360 should restart from 0.
train_data.Aspect.loc[train_data.Aspect<0] = 360 + train_data.Aspect
train_data.Aspect.loc[train_data.Aspect>360] = train_data.Aspect - 360
test_data.Aspect.loc[test_data.Aspect<0] = 360 + test_data.Aspect
test_data.Aspect.loc[test_data.Aspect>360] = test_data.Aspect - 360
#Aspect
toplot = train_data['Aspect']
fig = plt.figure(figsize=(10,5))
sns.distplot(toplot, hist = True,kde = True, kde_kws = {'shade': True, 'linewidth': 3})
#Slope
toplot = X['Slope']
fig = plt.figure(figsize=(14,9))
ax1 = plt.subplot(221)
sns.distplot(toplot, hist = True, fit = norm, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
ax2 = plt.subplot(222)
stats.probplot(toplot, plot=plt)
#Check Skewness and Kurtosis
print("Skewness: %f" % toplot.skew())
print("Kurtosis: %f" % toplot.kurt())
#Horizontal_Distance_To_Hydrology
toplot = X['Horizontal_Distance_To_Hydrology']
fig = plt.figure(figsize=(14,9))
ax1 = plt.subplot(221)
sns.distplot(toplot, hist = True, fit = norm, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
ax2 = plt.subplot(222)
stats.probplot(toplot, plot=plt)
#Check Skewness and Kurtosis
print("Skewness: %f" % toplot.skew())
print("Kurtosis: %f" % toplot.kurt())
#Vertical_Distance_To_Hydrology
toplot = X['Vertical_Distance_To_Hydrology']
fig = plt.figure(figsize=(14,9))
ax1 = plt.subplot(221)
sns.distplot(toplot, hist = True, fit = norm, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
ax2 = plt.subplot(222)
stats.probplot(toplot, plot=plt)
#Check Skewness and Kurtosis
print("Skewness: %f" % toplot.skew())
print("Kurtosis: %f" % toplot.kurt())
#Horizontal_Distance_To_Roadways
toplot = X['Horizontal_Distance_To_Roadways']
fig = plt.figure(figsize=(14,9))
ax1 = plt.subplot(221)
sns.distplot(toplot, hist = True, fit = norm, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
ax2 = plt.subplot(222)
stats.probplot(toplot, plot=plt)
#Check Skewness and Kurtosis
print("Skewness: %f" % toplot.skew())
print("Kurtosis: %f" % toplot.kurt())
#Horizontal_Distance_To_Roadways
toplot = X['Horizontal_Distance_To_Fire_Points']
fig = plt.figure(figsize=(14,9))
ax1 = plt.subplot(221)
sns.distplot(toplot, hist = True, fit = norm, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
ax2 = plt.subplot(222)
stats.probplot(toplot, plot=plt)
#Check Skewness and Kurtosis
print("Skewness: %f" % toplot.skew())
print("Kurtosis: %f" % toplot.kurt())
#Hillshade 9am
toplot = X['Hillshade_9am']
fig = plt.figure(figsize=(14,9))
ax1 = plt.subplot(221)
sns.distplot(toplot, hist = True, fit = norm, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
ax2 = plt.subplot(222)
stats.probplot(toplot, plot=plt)
#Check Skewness and Kurtosis
print("Skewness: %f" % toplot.skew())
print("Kurtosis: %f" % toplot.kurt())
#Hillshade Noon
toplot = X['Hillshade_Noon']
fig = plt.figure(figsize=(14,9))
ax1 = plt.subplot(221)
sns.distplot(toplot, hist = True, fit = norm, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
ax2 = plt.subplot(222)
stats.probplot(toplot, plot=plt)
#Check Skewness and Kurtosis
print("Skewness: %f" % toplot.skew())
print("Kurtosis: %f" % toplot.kurt())
#Hillshade_3pm
toplot = X['Hillshade_3pm']
fig = plt.figure(figsize=(14,9))
ax1 = plt.subplot(221)
sns.distplot(toplot, hist = True, fit = norm, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
ax2 = plt.subplot(222)
stats.probplot(toplot, plot=plt)
#Check Skewness and Kurtosis
print("Skewness: %f" % toplot.skew())
print("Kurtosis: %f" % toplot.kurt())
#Horizontal_Distance_To_Fire_Points
toplot = X['Horizontal_Distance_To_Fire_Points']
fig = plt.figure(figsize=(14,9))
ax1 = plt.subplot(221)
sns.distplot(toplot, hist = True, fit = norm, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
ax2 = plt.subplot(222)
stats.probplot(toplot, plot=plt)
#Check Skewness and Kurtosis
print("Skewness: %f" % toplot.skew())
print("Kurtosis: %f" % toplot.kurt())
For the One-Hot-Encoder, it might be more interesting to take a look at the number of 1 for each soil type. A Soil Type with a really small number of forests classified inside might lead to additional noise and could be dropped.
for i in range(10, train_data.shape[1]-1):
j = train_data.columns[i]
print (train_data[j].value_counts())
The Soil Type 14 has only 1 forest classified in it.
from sklearn.ensemble import RandomForestClassifier
plt.figure(figsize=(10,10))
name = "Random Forest"
rf = RandomForestClassifier()
rf.fit(X,y)
classifier = rf
indices = np.argsort(rf.feature_importances_)[::-1][:40]
g = sns.barplot(y=X.columns[indices][:40],x = classifier.feature_importances_[indices][:40] , orient='h')
g.set_xlabel("Relative importance",fontsize=12)
g.set_ylabel("Features",fontsize=12)
g.tick_params(labelsize=9)
g.set_title("Feature importance")
As we can notice, the column that has the most importance seems to be the Elevation. This seems that we should take a closer look at the distribution between the elevation and other columns, and focus our feature engineering on those columns.
#Elevation vs Aspect
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation, train_data.Aspect, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("Aspect")
plt.show()
#Elevation vs Slope
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation, train_data.Slope, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("Slope")
plt.show()
#Elevation vs Horizontal Distance Hydrology
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation, train_data.Horizontal_Distance_To_Hydrology, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("Horizontal Hydrology")
plt.show()
#Elevation vs Vertical Distance Hydrology
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation, train_data.Vertical_Distance_To_Hydrology, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("Vertical Hydrology")
plt.show()
#Elevation vs Horizontal Distance Roadways
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation, train_data.Horizontal_Distance_To_Roadways, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("Roadways")
plt.show()
#Elevation vs Horizontal Distance Fire Points
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation, train_data.Horizontal_Distance_To_Fire_Points, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("FirePoints")
plt.show()
#Elevation vs HillShade 9am
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation, train_data.Hillshade_9am, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("Roadways")
plt.show()
#Elevation vs HillShade Noon
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation, train_data.Hillshade_Noon, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("Roadways")
plt.show()
#Elevation vs HillShade 3pm
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation, train_data.Hillshade_3pm, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("Roadways")
plt.show()
Most of the relations seem linear which allows the developement of tree based measures. But it might be useful to adjust the relations for :
This step will be done in the Feature Engineering section.
I have created linear combinations of several columns and iterated to identify potentially interesting combinations. I have also tried to normalize the columns, but it did not bring much to the models tested (even though it is a requirement sometimes).
Add the columns discussed in the Approch section, we can compute the slope on the hydrology columns, and compute the mean distance to hydrology, fire points and roadways as a measure of interest.
train_data['slope_hydrology'] = np.sqrt((train_data['Vertical_Distance_To_Hydrology']*train_data['Vertical_Distance_To_Hydrology']) + (train_data['Horizontal_Distance_To_Hydrology']*train_data['Horizontal_Distance_To_Hydrology']))
train_data['mean_distance'] = (train_data['Horizontal_Distance_To_Hydrology'] + train_data['Horizontal_Distance_To_Fire_Points'] + train_data['Horizontal_Distance_To_Roadways'])/3
test_data['slope_hydrology'] = np.sqrt((test_data['Vertical_Distance_To_Hydrology']*test_data['Vertical_Distance_To_Hydrology']) + (test_data['Horizontal_Distance_To_Hydrology']*test_data['Horizontal_Distance_To_Hydrology']))
test_data['mean_distance'] = (test_data['Horizontal_Distance_To_Hydrology'] + test_data['Horizontal_Distance_To_Fire_Points'] + test_data['Horizontal_Distance_To_Roadways'])/3
I am also creating columns for combinations of the distances.
train_data['mean_1'] = (train_data['Horizontal_Distance_To_Hydrology'] + train_data['Horizontal_Distance_To_Fire_Points'])/2
train_data['mean_2'] = (train_data['Horizontal_Distance_To_Hydrology'] + train_data['Horizontal_Distance_To_Roadways'])/2
train_data['mean_3'] = (train_data['Horizontal_Distance_To_Roadways'] + train_data['Horizontal_Distance_To_Fire_Points'])/2
test_data['mean_1'] = (test_data['Horizontal_Distance_To_Hydrology'] + test_data['Horizontal_Distance_To_Fire_Points'])/2
test_data['mean_2'] = (test_data['Horizontal_Distance_To_Hydrology'] + test_data['Horizontal_Distance_To_Roadways'])/2
test_data['mean_3'] = (test_data['Horizontal_Distance_To_Roadways'] + test_data['Horizontal_Distance_To_Fire_Points'])/2
Finally, I have decided to create an average HillShade column :
train_data['avg_hillshade'] = (train_data['Hillshade_3pm'] + train_data['Hillshade_Noon'] + train_data['Hillshade_9am'])/3
test_data['avg_hillshade'] = (test_data['Hillshade_3pm'] + test_data['Hillshade_Noon'] + test_data['Hillshade_9am'])/3
For the Horizontal Distance To Hydrology :
#Elevation vs Horizontal Distance Hydrology
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation, train_data.Horizontal_Distance_To_Hydrology, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("Horizontal Hydrology")
plt.show()
#Adjustement
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation-0.17*train_data.Horizontal_Distance_To_Hydrology, train_data.Horizontal_Distance_To_Hydrology, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("Horizontal Hydrology")
plt.show()
For the Vertical Distance To Hydrology :
#Elevation vs Vertical Distance Hydrology
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation, train_data.Vertical_Distance_To_Hydrology, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("Vertical Hydrology")
plt.show()
#Elevation vs Vertical Distance Hydrology
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation-0.9*train_data.Vertical_Distance_To_Hydrology, train_data.Vertical_Distance_To_Hydrology, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("Vertical Hydrology")
plt.show()
For the Horizontal Distance To Roadways :
#Elevation vs Horizontal Distance Roadways
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation, train_data.Horizontal_Distance_To_Roadways, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("Roadways")
plt.show()
#Elevation vs Horizontal Distance Roadways
plt.figure(figsize=(8,8))
plt.scatter(train_data.Elevation-0.03*train_data.Horizontal_Distance_To_Roadways, train_data.Horizontal_Distance_To_Roadways, c=train_data.Cover_Type)
plt.xlabel("Elevation")
plt.ylabel("Roadways")
plt.show()
Now we have created this linearity, we just have to create the corresponding columns in the data frames.
train_data['Adjusted_HDH'] = train_data['Elevation'] - 0.17 * train_data['Horizontal_Distance_To_Hydrology']
test_data['Adjusted_HDH'] = test_data['Elevation'] - 0.17 * test_data['Horizontal_Distance_To_Hydrology']
train_data['Adjusted_VDH'] = train_data['Elevation'] - 0.9 * train_data['Vertical_Distance_To_Hydrology']
test_data['Adjusted_VDH'] = test_data['Elevation'] - 0.9 * test_data['Vertical_Distance_To_Hydrology']
train_data['Adjusted_HDR'] = train_data['Elevation'] - 0.03 * train_data['Horizontal_Distance_To_Roadways']
test_data['Adjusted_HDR'] = test_data['Elevation'] - 0.03 * test_data['Horizontal_Distance_To_Roadways']
The aime of this section is to look at some possible transformations of datas.
Firstly, compute the cosine of the slope as it is quite an important feature :
train_data['Cos_slope'] = np.cos(train_data['Slope'])
test_data['Cos_slope'] = np.cos(test_data['Slope'])
Take a look at the Square and the log of the Elevation :
train_data['Sqr_Elev'] = train_data['Elevation']*train_data['Elevation']
test_data['Sqr_Elev'] = test_data['Elevation']*test_data['Elevation']
train_data['Log_Elev'] = np.log(train_data['Elevation'])
test_data['Log_Elev'] = np.log(test_data['Elevation'])
y = train_data['Cover_Type']
X = train_data.loc[:, train_data.columns != 'Cover_Type']
To avoid testing models on 550'000 lines, we build train and test samples of 10'000 observations.
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)
X_train.shape,X_test.shape,y_train.shape,y_test.shape
X_train = X_train.sample(n=10000, random_state=1, replace=True)
y_train = y_train.sample(n=10000, random_state=1, replace=True)
X_test = X_test.sample(n=10000, random_state=1, replace=True)
y_test = y_test.sample(n=10000, random_state=1, replace=True)
df_sample = train_data.sample(n=10000, random_state=1, replace=True)
Other methods flag observations based on measures such as the interquartile range. For example, if Q1 and Q3 are the lower and upper quartiles respectively, then one could define an outlier to be any observation outside the range:
[Q1-k(Q3-Q1),Q3+k(Q3-Q1)]
for some nonnegative constant k. John Tukey proposed this test, where k=1.5 indicates an "outlier", and k=3 indicates data that is "far out". As we did not notice any major outlier on the histograms above, we can expect our outlier detection program to return only few/no outlier.
def detect_outliers(df,n,features):
"""
Takes a dataframe df of features and returns a list of the indices
corresponding to the observations containing more than n outliers according
to the Tukey method.
"""
outlier_indices = []
from collections import Counter
# iterate over features(columns)
for col in features:
# 1st quartile (25%)
Q1 = np.percentile(df[col], 25)
# 3rd quartile (75%)
Q3 = np.percentile(df[col],75)
# Interquartile range (IQR)
IQR = Q3 - Q1
# outlier step
outlier_step = 3 * IQR
# Determine a list of indices of outliers for feature col
outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index
# append the found outlier indices for col to the list of outlier indices
outlier_indices.extend(outlier_list_col)
# select observations containing more than 2 outliers
outlier_indices = Counter(outlier_indices)
multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
return multiple_outliers
# detect outliers from Age, SibSp , Parch and Fare
Outliers_to_drop = detect_outliers(train_data,2,["Elevation","Aspect","Slope","Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways", "Hillshade_9am", "Hillshade_Noon","Hillshade_3pm","Horizontal_Distance_To_Fire_Points" ])
print(Outliers_to_drop)
train_data.loc[Outliers_to_drop]
#X = X.drop(Outliers_to_drop, axis = 0).reset_index(drop=True)
#y = y.drop(Outliers_to_drop, axis = 0).reset_index(drop=True)
train_data = train_data.drop(Outliers_to_drop, axis = 0).reset_index(drop=True)
To further analyze our dataset, we can draw a Pairplot on the whole dataset excluding the One Hot Encoder and the Wilderness areas.
fig1 = sns.PairGrid(df_sample.iloc[:,0:10], height=2).map_diag(sns.kdeplot, color='Blue', legend=False).map_upper(plt.scatter, edgecolor="w", s=30).map_lower(sns.kdeplot, shade=True,shade_lowest=False, cmap="mako", legend=False)
display(fig1)
#sns.pairplot(train_data.iloc[:,0:10])
The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In practice, the covariance (and sometimes the correlation) matrix of the data is constructed and the eigenvectors on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the principal components) can now be used to reconstruct a large fraction of the variance of the original data. Moreover, the first few eigenvectors can often be interpreted in terms of the large-scale physical behavior of the system.
From : https://en.wikipedia.org/wiki/Dimensionality_reduction
rndperm = np.random.permutation(df_sample.shape[0])
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca_result = pca.fit_transform(df_sample.loc[:, df_sample.columns != 'Cover_Type'].values)
pca_cols = pd.DataFrame(df_sample)
pca_cols['pca-one'] = pca_result[:,0]
pca_cols['pca-two'] = pca_result[:,1]
pca_cols['pca-three'] = pca_result[:,2]
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))
#In 2D
from pandas import Timestamp
from ggplot import *
chart = ggplot(pca_cols.loc[rndperm[:],:], aes(x='pca-one', y='pca-two', z='pca-three',color='Cover_Type') ) \
+ geom_point(size=75,alpha=0.8) \
+ ggtitle("First and Second Principal Components colored by digit")
chart
# Or in 3D :
from mpl_toolkits.mplot3d import axes3d, Axes3D
from mpl_toolkits.mplot3d import proj3d
plt.figure(figsize=(20,20))
fig = plt.figure(figsize=(20,20))
ax = fig.gca(projection='3d')
ax.scatter(pca_cols['pca-one'], pca_cols['pca-two'], pca_cols['pca-three'], cmap="Set2_r", c = pca_cols['Cover_Type'])
# label the axes
ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
ax.set_zlabel("PC3")
ax.set_title("First, Second and Third Principal Components colored by digit")
ax.view_init(10, 45)
#ax.legend()
plt.show()
The idea is to select models that tend to work best on a small yet representative data sampleof 10'000 lines. This will allow us to do much quicker computations and eventually observe trends regarding which models work best.
This approach is naive. No parameters are set, we simply want to rank the magnitude of the model accuracy.
Finally, models are ordered.
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold #for K-fold cross validation
from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict #prediction
kfold = KFold(n_splits=2, random_state=23)
In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Naïve Bayes cannot handle negative values.
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
scaled_df = scaler.fit_transform(df_sample)
scaled_df = pd.DataFrame(scaled_df, columns=df_sample.columns)
y_scaled = scaled_df['Cover_Type']
X_scaled = scaled_df.loc[:, scaled_df.columns != 'Cover_Type']
X_train_scaled,X_test_scaled,y_train_scaled,y_test_scaled = train_test_split(X_scaled,y_scaled,test_size=0.01,random_state=42)
X_train_scaled_sample = X_train_scaled.sample(n=10000, random_state=1, replace=True)
y_train_scaled_sample = y_train_scaled.sample(n=10000, random_state=1, replace=True)
X_test_scaled_sample = X_test_scaled.sample(n=10000, random_state=1, replace=True)
y_test_scaled_sample = y_test_scaled.sample(n=10000, random_state=1, replace=True)
df_scaled_sample = scaled_df.sample(n=10000, random_state=1, replace=True)
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred_gnb = gnb.fit(X_train, y_train).predict(X_test)
result_gnb = accuracy_score(y_test, y_pred_gnb)*100
print("Accuracy Model : %.2f%% " % result_gnb)
sns.heatmap(confusion_matrix(y_test,y_pred_gnb),annot=True,fmt='3.0f',cmap="coolwarm")
plt.title('Confusion_matrix', y=1.05, size=15)
Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
# Train the model using the training sets and check score
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
result_lr = accuracy_score(y_test, y_pred_lr)*100
print("Accuracy Model : %.2f%% " % result_lr)
sns.heatmap(confusion_matrix(y_test,y_pred_lr),annot=True,fmt='3.0f',cmap="coolwarm")
plt.title('Confusion_matrix', y=1.05, size=15)
Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves).
from sklearn import tree
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
dt = tree.DecisionTreeClassifier(random_state=0)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
result_dt = accuracy_score(y_test, y_pred_dt)*100
print("Accuracy Model : %.2f%% " % result_dt)
sns.heatmap(confusion_matrix(y_test,y_pred_dt),annot=True,fmt='3.0f',cmap="coolwarm")
plt.title('Confusion_matrix', y=1.05, size=15)
Random forest classifier creates a set of decision trees from randomly selected subset of training set. It then aggregates the votes from different decision trees to decide the final class of the test object.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
result_rf = accuracy_score(y_test, y_pred_rf)*100
print("Accuracy Model : %.2f%% " % result_rf)
sns.heatmap(confusion_matrix(y_test,y_pred_rf),annot=True,fmt='3.0f',cmap="coolwarm")
plt.title('Confusion_matrix', y=1.05, size=15)
The main difference between random forests and extra trees (usually called extreme random forests) lies in the fact that, instead of computing the locally optimal feature/split combination (for the random forest), for each feature under consideration, a random value is selected for the split (for the extra trees).
This leads to more diversified trees and less splitters to evaluate when training an extremly random forest.
from sklearn import ensemble
from sklearn.ensemble import ExtraTreesClassifier
extc = ExtraTreesClassifier()
extc.fit(X_train, y_train)
y_pred_extc = extc.predict(X_test)
result_extc = accuracy_score(y_test, y_pred_extc)*100
print("Accuracy Model : %.2f%% " % result_extc)
sns.heatmap(confusion_matrix(y_test,y_pred_extc),annot=True,fmt='3.0f',cmap='coolwarm')
plt.title('Confusion_matrix', y=1.05, size=15)
In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. However, KNN are usually not good in high dimension problems like this one, and usually need datas to be scaled.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 4)
knn.fit(X_train,y_train)
y_pred_knn = knn.predict(X_test)
result_knn = accuracy_score(y_test, y_pred_knn)*100
print("Accuracy Model : %.2f%% " % result_knn)
sns.heatmap(confusion_matrix(y_test,y_pred_knn),annot=True,fmt='3.0f',cmap="coolwarm")
plt.title('Confusion_matrix', y=1.05, size=15)
Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).
Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.
from sklearn.linear_model import SGDClassifier
gdc = SGDClassifier(penalty="l2", max_iter=5)
gdc.fit(X_train, y_train)
y_pred_gdc = gdc.predict(X_test)
result_gdc = accuracy_score(y_test, y_pred_gdc)*100
print("Accuracy Model : %.2f%% " % result_gdc)
sns.heatmap(confusion_matrix(y_test,y_pred_gdc),annot=True,fmt='3.0f',cmap="coolwarm")
plt.title('Confusion_matrix', y=1.05, size=15)
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=7, learning_rate=1.0, max_depth=5, random_state=0)
gbc.fit(X_train, y_train)
y_pred_gbc = gbc.predict(X_test)
result_gbc = accuracy_score(y_test, y_pred_gbc)*100
print("Accuracy Model : %.2f%% " % result_gbc)
sns.heatmap(confusion_matrix(y_test,y_pred_gbc),annot=True,fmt='3.0f',cmap="coolwarm")
plt.title('Confusion_matrix', y=1.05, size=15)
XGBoost is an algorithm that has recently been dominating applied machine learning for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
result_xgb = accuracy_score(y_test, y_pred_xgb)*100
print("Accuracy Model : %.2f%% " % result_xgb)
sns.heatmap(confusion_matrix(y_test,y_pred_xgb),annot=True,fmt='3.0f',cmap="coolwarm")
plt.title('Confusion_matrix', y=1.05, size=15)
The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers.
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
ada.fit(X_train, y_train)
y_pred_ada = ada.predict(X_test)
result_ada = accuracy_score(y_test, y_pred_ada)*100
print("Accuracy Model : %.2f%% " % result_ada)
sns.heatmap(confusion_matrix(y_test,y_pred_ada),annot=True,fmt='3.0f',cmap="coolwarm")
plt.title('Confusion_matrix', y=1.05, size=15)
Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting).
from sklearn import svm
svm = svm.SVC()
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
result_svm = accuracy_score(y_test, y_pred_svm)*100
print("Accuracy Model : %.2f%% " % result_svm)
sns.heatmap(confusion_matrix(y_test,y_pred_svm),annot=True,fmt='3.0f',cmap="coolwarm")
plt.title('Confusion_matrix', y=1.05, size=15)
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda= LinearDiscriminantAnalysis()
lda.fit(X_train,y_train)
y_pred_lda=lda.predict(X_test)
result_lda = accuracy_score(y_test, y_pred_lda)*100
print("Accuracy Model : %.2f%% " % result_lda)
sns.heatmap(confusion_matrix(y_test,y_pred_lda),annot=True,fmt='3.0f',cmap="coolwarm")
plt.title('Confusion_matrix', y=1.05, size=15)
Linear Discriminant Analysis can only learn linear boundaries, while Quadratic Discriminant Analysis can learn quadratic boundaries and is therefore more flexible.
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
qda= QuadraticDiscriminantAnalysis()
qda.fit(X_train,y_train)
y_pred_qda=qda.predict(X_test)
result_qda = accuracy_score(y_test, y_pred_qda)*100
print("Accuracy Model : %.2f%% " % result_qda)
sns.heatmap(confusion_matrix(y_test,y_pred_qda),annot=True,fmt='3.0f',cmap="coolwarm")
plt.title('Confusion_matrix', y=1.05, size=15)
An MLP consists of, at least, three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training.
from sklearn.neural_network import MLPClassifier
mlp= MLPClassifier()
mlp.fit(X_train,y_train)
y_pred_mlp=mlp.predict(X_test)
result_mlp = accuracy_score(y_test, y_pred_mlp)*100
print("Accuracy Model : %.2f%% " % result_mlp)
sns.heatmap(confusion_matrix(y_test,y_pred_mlp),annot=True,fmt='3.0f',cmap="coolwarm")
plt.title('Confusion_matrix', y=1.05, size=15)
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
'Random Forest', 'Extra Trees','Naive Bayes', 'AdaBoostClassifier',
'Gradient Descent', 'Gradient Boosting', 'Linear Discriminant Analysis',
'Decision Tree', 'XGBoost', 'Quadratic Discriminant', 'Multi-Layer Perceptron'],
'Score': [result_svm, result_knn, result_lr,
result_rf, result_extc, result_gnb, result_ada,
result_gdc, result_gbc, result_lda, result_dt, result_xgb, result_qda, result_mlp]})
models = models.sort_values(by='Score',ascending=False)
models
plt.figure(figsize=(10,10))
g = sns.barplot(y=models['Model'],x = models['Score'] , orient='h')
g.set_xlabel("Accuracy Score",fontsize=12)
g.set_ylabel("Models",fontsize=12)
g.tick_params(labelsize=9)
g.set_title("Model selection on reduced sample")
plt.show()
Quite obviously, some models stand out :
A classic approach is to develop an Ensemble model based on a Voting Classifier for example.
However, in this case, the extra trees is simply an extension of the random forest classifier. It is therefore quite pointless to create an ensemble model for this one. We will therefore simply perform parameter tuning on the Extra Trees Classifier.
The parameter tuning will be made essentially using a Cross Validation approach, and by submitting several times the predictions on the Kaggle Challenge to learn from our models.
Learning curve shows how error changes as the training set size increases.
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
"""Generate a simple plot of the test and training learning curve"""
from sklearn.model_selection import learning_curve
plt.figure(figsize=(12,6))
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
extc_param_grid = {'n_estimators' : [400],
'max_depth' : [400],
'warm_start': [True]
}
extc_gs = GridSearchCV(extc,param_grid = extc_param_grid, cv=4, scoring="accuracy", n_jobs= -1, verbose = 1)
extc_gs.fit(X_train,y_train)
# Best score
param_extc_gs = extc_gs.best_estimator_
result_extc_gs = extc_gs.best_score_*100
print("Accuracy CV : %.2f%% " % (result_extc_gs))
g = plot_learning_curve(param_extc_gs,"Extra Trees",X_train,y_train,cv=kfold)
plt.figure(figsize=(10,10))
name = "Extra Trees"
classifier = extc
indices = np.argsort(classifier.feature_importances_)[::-1][:40]
g = sns.barplot(y=X_train.columns[indices][:40],x = classifier.feature_importances_[indices][:40] , orient='h')
g.set_xlabel("Relative importance",fontsize=12)
g.set_ylabel("Features",fontsize=12)
g.tick_params(labelsize=9)
g.set_title("Feature importance")
Theses parameters have been chosen after several cross validations. I have reported in the next section the results of the previous iterations.
from sklearn import ensemble
from sklearn.ensemble import ExtraTreesClassifier
extc = ExtraTreesClassifier(max_features = len(X.columns)-1, n_estimators = 325, max_depth = 400, warm_start = True)
extc.fit(X, y)
y_pred_extc = extc.predict(test_data)
test_class = pd.Series(y_pred_extc, name="XTRA")
IDtest = pd.DataFrame(test_data.index.values)
results = pd.concat([IDtest,test_class],axis=1)
results.columns = ['Id', 'Cover_Type']
results.to_csv("ensemble_python_voting_xtra16.csv",index=False)
Once we have trained the classifier, we can use it to make predictions on the test data. I am storing here the previous optimal parameters of the Cross Validations I have made as well as the prediction result on Kaggle.
extc_opt = ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=150, max_features=53, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
0.94900
extc_opt2 = ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=150, max_features=53, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
0.95790
extc_opt3 = ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=250, max_features=53, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=1,
oob_score=False, random_state=None, verbose=1, warm_start=True)
0.95803
extc_opt4 = ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=350, max_features=53, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=True)
0.95810
extc_opt5 = ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=400, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=450, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=True)
0.949
extc_opt6 = ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=450, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=True)
0.948
extc_opt7 = ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=400, max_features=53, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=True)
0.9577
extc_opt8 = extc_param_grid = {'max_features' : [len(X.columns)-1],
'n_estimators' : [325],
'max_depth' : [400],
'warm_start': [True]
}
0.95833
extc_opt9 = extc_param_grid = {'max_features' : [len(X.columns)-1],
'n_estimators' : [325],
'max_depth' : [600],
'warm_start': [True]
}
0.95811
extc_opt10 = extc_param_grid = {'max_features' : [len(X.columns)-1],
'n_estimators' : [310],
'max_depth' : [450],
'warm_start': [True]
}
0.95788
extc_opt11 = ExtraTreesClassifier(bootstrap=False, class_weight='balanced_subsample',
criterion='entropy', max_depth=400, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=450, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)
0.94788
extc_opt12 = extc_param_grid = {'bootstrap' : [False],
'class_weight' : ['balanced_subsample'],
'criterion' : ['entropy'],
'max_depth' : [400],
'n_estimators' : [450],
'oob_score' : [False],
'random_state' : [None],
'warm_start' : [True]
}
0.95008
Not applied here as the Extra Tree model is satisfying by itself.
When applying the Voting Classifier :
from sklearn.ensemble import VotingClassifier
votingC = VotingClassifier(estimators=[('rf_gs', param_rf_gs), ('extc_gs', param_extc_gs),
('ada_gs',param_ada_gs),('svm_gs',param_svm_gs)], voting='hard', n_jobs=-1)
votingC = votingC.fit(X, y)
test_class = pd.Series(votingC.predict(test_data), name="Voting")
IDtest = test_data.index
results = pd.concat([IDtest,test_class],axis=1)
results.to_csv("ensemble_python_voting.csv",index=False)